Alternatives to box plots: Using beeswarm and raincloud plots to summarise ecological data

Box plots are a common way to summarise data and its spread in ecology and biology research, but box plots have their weaknesses. Here we’ll show how to use {ggbeeswarm}, {ggdist} and {gghalves} to make beeswarm and raincloud plots quickly using plant trait data from {austraits}.

Eukaryota
Plantae
Summaries
Authors

Dax Kellie

Shandiya Balasubramanium

Published

May 16, 2023

Author

Dax Kellie
Shandiya Balasubramanium

Date

5 May 2023

In ecology and biology research, box plots are a wonderfully simple and efficient way to summarise data from different groups (e.g., species, populations, experimental conditions). However, this simplicity can also sometimes hide the underlying structure of data, unintentionally misleading readers.

Luckily, there are alternative ways to display data other than with box plots that have grown increasingly easy to make in R. This post will use an ecological lens to show more transparent ways to summarise data.

In our example, we’ll compare several species of plants using a leaf trait commonly used to predict a plant’s performance: Leaf dry mass per area (or more simply, the size of a leaf given its surface area). We’ll explain how best to add points to box plots, and how to create raincloud plots—one of the most transparent ways to display summary data.

Why not boxplots?

Here is one example of how box plots on their own (like the one on the left) can mask differences in sample size and underlying data structure (shown on the right). This example is taken from Cedric Scherer’s blog post on the same topic which I encourage you to check out.

Code
library(ggplot2)
library(dplyr)
library(glue)
library(ggtext)

# generate sample data
set.seed(2021)

data <- tibble(
  group = factor(c(rep("Group 1", 100), rep("Group 2", 250), rep("Group 3", 25))),
  value = c(seq(0, 20, length.out = 100),
            c(rep(0, 5), rnorm(30, 2, .1), rnorm(90, 5.4, .1), rnorm(90, 14.6, .1), rnorm(30, 18, .1), rep(20, 5)),
            rep(seq(0, 20, length.out = 5), 5))
  ) %>% 
  rowwise() |>
  mutate(value = if_else(group == "Group 2", value + rnorm(1, 0, .4), value))

## Left box plot
boxplot <- 
  ggplot(data, aes(x = group, y = value)) +
  geom_boxplot(fill = "grey92") + 
  pilot::theme_pilot() + 
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank())


## Right box plot

# find sample size and create x axis labels
sample_size <- data |>
  group_by(group) |>
  summarise(n = n())

sample_size_label <- glue("
                          {sample_size$group} <br>
                          **N = {sample_size$n}**
                          ")


boxplot_with_data <- 
  ggplot(data, aes(x = group, y = value)) +
  geom_boxplot(fill = "grey92") +
  geom_point(
    size = 2,
    alpha = .3,
    # add some jittering
    position = position_jitter(
      seed = 1, width = .2
    )) +
  pilot::theme_pilot() + 
  scale_x_discrete(labels = sample_size_label) +
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_markdown())

boxplot
boxplot_with_data

To see how to summarise data more transparently, let’s download some data and see how adding data points and distributions can help improve our visualisations.

Download data

First let’s download some packages we’ll need.

# packages
library(tidyverse)
library(purrr)
library(here)

To download data we will use the {austraits} package to get data of Australian plant traits from AusTraits, an open-source database of nearly 500 traits across more than 30,000 taxa from field surveys, published papers, and taxonomic books.

We can download the entire austraits data set with the load_austraits() function.

Note

To make analyses more reproducible, load_austraits() requires users to specify a version. You can see available versions using the get_versions() function.

library(austraits)

austraits <- load_austraits(version = "4.1.0", 
                            path = here::here("posts", "data", "austraits"))

There’s a huge range of available plant traits, and if you are interested in seeing what else there is, you should check out their AusTraits Plant Dictionary for a complete list. For a brief breakdown of trait names and the data available for each, you can use summarise_austraits("trait_name"). Rather than showing the entire list, here’s a random sample to give you an idea.

# show a sample of available traits
austraits %>% 
  summarise_austraits("trait_name") |>
  slice_sample(n = 10) |>
  as_tibble()
# A tibble: 10 × 5
   trait_name                           n_records n_dataset n_taxa percent_total
   <chr>                                    <int>     <int>  <int>         <dbl>
 1 leaf_work_to_shear                         192         5    137     0.000153 
 2 leaf_lobation                             4309         5   4193     0.00344  
 3 soil_seedbank                              522         4    515     0.000417 
 4 leaf_cell_wall_N_per_cell_wall_dry_…        29         1     22     0.0000231
 5 seed_shape                                2965         8   2722     0.00237  
 6 fire_and_establishing                     1618         2   1586     0.00129  
 7 leaflet_area                               424         5     51     0.000338 
 8 flower_perianth_merism                     268         1    202     0.000214 
 9 water_potential_88percent_lost_cond…        81         2     79     0.0000646
10 wood_delta15N                              256         3     35     0.000204 

Leaf mass per area (LMA)

The plant trait data we’ll download is leaf dry mass per area (LMA), which measures how big and dense a leaf is compared to its surface area.

Here is a nice example from Butrim & Royer (2020). Notice that the small, dense leaf on the top has much higher leaf mass per area than the two large, light leaves on the bottom.

Leaves from 34-33 million years ago, taken from a post by the Botanical Society of America

Leaf mass per area is a nice indicator of each plant’s strategy for survival, and can be thought of on a spectrum:


High-LMA


Medium-LMA


Low-LMA

On one side, plants with small, dense leaves (High-LMA) acquire resources (like nutrients and energy) gradually and grow slowly, aiming to conserve what resources they have. These plants tend to have an advantage in unproductive environments (where they can efficiently store whatever limited resources are available).

On the other side, plants with big, light leaves (Low-LMA) acquire resources quickly and grow fast, aiming to out-compete others for the resources on offer. These plants tend to have an advantage in highly productive environments (where they can get lots of resources and use them quickly) (Poorter et al. 2009). They contribute to these ecosystems’ productivity, too, because they degrade faster and are preferred by herbivores, cycling nutrients into the ecosystem.

This relationship between LMA and the environment type a plant is better suited for allows us to infer what kind of ecosystem each plant lives in depending on its LMA (which we’ll try to do at the end).

Extract trait data

To download LMA trait data, we’ll use extract_trait(). This downloads a list of data and metadata about authors, collection methods, locations, taxonomic information and data structure. To only grab the data.frame we want from our list, we’ll use purrr::pluck().

Tip

You can use lookup_trait() to search for traits

# lookup_trait(austraits, "leaf_mass")

# Get trait data
leaf_mass <- austraits |> 
  extract_trait("leaf_mass_per_area") |> 
  pluck("traits")

leaf_mass
# A tibble: 27,946 × 24
   dataset_id  taxon_name      observation_id trait_name value unit  entity_type
   <chr>       <chr>           <chr>          <chr>      <dbl> <chr> <chr>      
 1 Ahrens_2019 Corymbia calop… 001            leaf_mass…  185. g/m2  individual 
 2 Ahrens_2019 Corymbia calop… 002            leaf_mass…  138. g/m2  individual 
 3 Ahrens_2019 Corymbia calop… 003            leaf_mass…  169. g/m2  individual 
 4 Ahrens_2019 Corymbia calop… 004            leaf_mass…  151. g/m2  individual 
 5 Ahrens_2019 Corymbia calop… 005            leaf_mass…  142. g/m2  individual 
 6 Ahrens_2019 Corymbia calop… 006            leaf_mass…  156. g/m2  individual 
 7 Ahrens_2019 Corymbia calop… 007            leaf_mass…  150. g/m2  individual 
 8 Ahrens_2019 Corymbia calop… 008            leaf_mass…  175. g/m2  individual 
 9 Ahrens_2019 Corymbia calop… 009            leaf_mass…  148. g/m2  individual 
10 Ahrens_2019 Corymbia calop… 010            leaf_mass…  135. g/m2  individual 
# ℹ 27,936 more rows
# ℹ 17 more variables: value_type <chr>, basis_of_value <chr>,
#   replicates <chr>, basis_of_record <chr>, life_stage <chr>,
#   population_id <chr>, individual_id <chr>, temporal_id <chr>,
#   source_id <chr>, location_id <chr>, entity_context_id <chr>, plot_id <chr>,
#   treatment_id <chr>, collection_date <chr>, measurement_remarks <chr>,
#   method_id <chr>, original_name <chr>

Sample species

Let’s get a sample of our 7 species to summarise. To make sure that our sample species have enough data points to summarise, let’s filter leaf_mass to only taxa with at least 10 data points. To do this, we’ll count the number of data points by taxon_name, filter to only those with 10 data points or more, and use pull() to extract them. We’ll save this list of names as filtered_names.

# get species with more than 10 records
filtered_names <- leaf_mass |>
  group_by(taxon_name) |>
  count() |>
  filter(n >= 10) |>
  pull(taxon_name)

filtered_names |> head(10L)
 [1] "Acacia acinacea"       "Acacia acuminata"      "Acacia anceps"        
 [4] "Acacia aneura"         "Acacia argyrophylla"   "Acacia aulacocarpa"   
 [7] "Acacia auriculiformis" "Acacia baileyana"      "Acacia bidwillii"     
[10] "Acacia brachybotrya"  

Next we’ll filter our leaf_mass data to only those in filtered_names.

leaf_mass_filtered <- leaf_mass |>
  filter(taxon_name %in% filtered_names)

Filtering out these taxa has removed more than 3,000 taxa from our dataset.

Code
n_taxa <- leaf_mass |> distinct(taxon_name) |> count()
n_taxa_filtered <- leaf_mass_filtered |> distinct(taxon_name) |> count()

n_taxa - n_taxa_filtered
     n
1 3196

Next, let’s get a sample of these taxa to plot. One way to get a random sample of 7 taxa is to use filter to return data of only 7 unique taxa names.

leaf_mass_filtered |> 
  filter(taxon_name %in% sample(unique(taxon_name), 7))

Here’s a random sample of taxa we prepared earlier, which we’ll specify just so we can look at some extra maps after we summarise our plant trait data as well. We’ll call our sample leaf_mass_sample.

sample_names <- c("Cryptocarya rigida", "Pteridium esculentum", 
                  "Eucalyptus baxteri", "Melaleuca armillaris",
                  "Eucalyptus wandoo", "Pentameris airoides",
                  "Eucalyptus piperita")

leaf_mass_sample <- leaf_mass_filtered |>
  filter(taxon_name %in% sample_names)

Beeswarm plot

The first plot we’ll learn to make is a beeswarm plot using the {ggbeeswarm} package. Beeswarm plots are useful because they allow you to plot points next to each other that would normally overlap. These plots are especially nice because the points are plotted in a way that visualises density much like a violin plot while still showing each individual point.

Let’s load {ggbeeswarm} and try plotting our plant species’ trait values using geom_quasirandom(), which uses a sequencing algorithm to place points nicely next to each other.

library(ggbeeswarm)

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name, 
           y = value, 
           colour = taxon_name)) +
  ggbeeswarm::geom_quasirandom(size = 2)

Now let’s take a few steps to clean this plot up. We’ll use one of our favourite theme packages, {pilot}, to get some nice colours and make the plot look neater. We’ll also use stringr::str_wrap() and reorder() to wrap the long species names on the x-axis and order our species in ascending order.

library(pilot) # remotes::install_github("olihawkins/pilot")
library(stringr)

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name)) +
  ggbeeswarm::geom_quasirandom(size = 2,      # size of points
                               width = .3,    # width of points
                               alpha = .7) +  # opacity of points
  labs(y = "Leaf mass per area (g/m<sup>2</sup>)",
       x = "Species") +
  pilot::scale_color_pilot() +
  pilot::theme_pilot(grid = "h",
                     axes = "b") +
  theme(legend.position = "none",
        axis.title.y = ggtext::element_markdown()) # allows html in axis title

Because {ggbeeswarm} places points in a semi-random but still calculated spot, the appearance is less confusing to readers than placing points randomly using other functions like geom_jitter().

For extra readability, we can add a box plot along with our points to help summarise their spread, too. To help make our boxes stand out while still maintaining our colour palette, we’ll use the {colorspace} package to lighten each species’ colour.

The {colorspace} package is a toolbox designed to select clear colour palettes while accounting for visual colour deficiencies. To use {colorspace} to adjust colours after calculating each box, we’ll use after_scale() and lighten() on our fill and colour arguments in geom_boxplot().

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name, 
           fill = taxon_name)) +
  geom_boxplot(
    aes(fill = taxon_name,
        fill = after_scale(colorspace::lighten(fill, .9)),
        colour = taxon_name,
        colour = after_scale(colorspace::lighten(colour, .3))),
    size = 1,
    outlier.shape = NA
    ) +
  ggbeeswarm::geom_quasirandom(size = 4, 
                               width = .25, 
                               alpha = .7) +
  # scale_y_continuous(expand=c(0,0)) +
  scale_y_continuous(breaks = c(0, 100, 200, 300, 400),
                     labels = c(0, 100, 200, 300, 400),
                     limits = c(0, 400),
                     expand = c(0,0)) +
  labs(y = "Leaf mass per area (g/m<sup>2</sup>)",
       x = "Species") +
  pilot::scale_color_pilot() +
  pilot::scale_fill_pilot() +
  pilot::theme_pilot(grid = "h",
                     axes = "b") + 
  theme(legend.position = "none",
        axis.title.y = ggtext::element_markdown(),
        axis.text.x = element_text(face = "italic"))

Raincloud plots

In 2019, Raincloud plots were proposed as one way to visualise data with “maximal statistical information while preserving the desired ‘inference at a glance’ nature of barplots and other similar visualization devices.” By displaying points, densities and summary stats like median, mean and confidence intervals, raincloud plots are robust, informative, and (dare I say it) stunning.

Let’s plot the same data of our 7 species above as a raincloud plot. We’ll load the {ggdist} package and the {gghalves} package to allow us to plot different parts of our raincloud plot.

library(ggdist)
library(gghalves)

To begin, let’s plot the distribution on one half of our raincloud plot using stat_halfeye(), and the points that make up that distribution on the other half of our plot with geom_half_point(). We’ll also flip our plot on its side using coord_flip() to give our plot and labels more space (and to eventually help make our raincloud effect)1.

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name, 
           fill = taxon_name)) +
  ggdist::stat_halfeye(adjust = .4,           # smoothness of distribution
                       width = .87,           # height of distribution
                       colour = NA) +
  gghalves::geom_half_point(side = "l",       # choose right or left side
                            range_scale = .3, # spread of points
                            alpha = .6,
                            size = 2.2) +
  coord_flip() +
  labs(x = "Species",
       y = "Leaf mass to area")

Now we can add our boxplot, using our colorspace trick with the aes() argument of geom_boxplot() to darken the lines of each box. We’ll also add use {pilot} to add some nice colours and styling to our plot.

And just like that we can make it rain!2

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name, 
           fill = taxon_name)) +
  ggdist::stat_halfeye(adjust = .4,
                       width = .87,
                       colour = NA) +
  gghalves::geom_half_point(side = "l",
                            range_scale = .3,
                            alpha = .6,
                            size = 2.2) +
  geom_boxplot(
    aes(colour = taxon_name,
        colour = after_scale(colorspace::darken(colour, .7))),
    width = .12,        # adjust box width
    fill = NA,
    size = 1.1,         # size of box line
    outlier.shape = NA  # remove outlier points
    ) +
  coord_flip() +
  labs(x = "Species",
       y = "Leaf mass per area (g/m<sup>2</sup>)") +
  scale_y_continuous(breaks = c(0, 100, 200, 300, 400),
                     labels = c(0, 100, 200, 300, 400),
                     limits = c(0, 400),
                     expand = c(0,0)) +
  pilot::scale_color_pilot() +
  pilot::scale_fill_pilot() +
  pilot::theme_pilot(grid = "",
                     axes = "b") + 
  theme(legend.position = "none",
        axis.title.x = ggtext::element_markdown(),
        axis.text.y = element_text(face = "italic"))

Bonus: Plant distribution

A plant’s leaf mass per area (LMA) can tell us a lot about the types of ecosystems that these plants live in. By mapping each plant’s spatial distributions, we can get an idea of the broader range of these ecosystems across Australia.

Let’s see where our plant species that we summarised above live!

Remember, low-LMA species like Cryptocarya rigida have big, light leaves that thrive in wetter ecosystems with lots of nutrients. Alternatively, high-LMA species like Eucalyptus wandoo have small, dense leaves that thrive in drier ecosystems with little nutrients.

Is this where you expected these plants to be found?

Code
library(galah)
library(sf)
library(ozmaps)

# Download data
galah_config(email = "dax.kellie@csiro.au", verbose = FALSE)

plants <- galah_call() |>
  galah_identify(sample_names) |>
  galah_apply_profile(ALA) |>
  atlas_occurrences()

# Recategorise subspecies into species categories
plants <- plants |>
  drop_na(decimalLatitude, decimalLatitude) |>
  mutate(names = case_when(
    str_detect(scientificName, "Eucalyptus wandoo") ~ "Eucalyptus wandoo",
    str_detect(scientificName, "Pentameris airoides") ~ "Pentameris airoides",
    str_detect(scientificName, "Melaleuca armillaris") ~ "Melaleuca armillaris",
    str_detect(scientificName, "Pteridium esculentum") ~ "Pteridium esculentum",
    .default = scientificName)
    )

# Australia map
aus <- ozmaps::ozmap_country |>
  st_transform(crs = st_crs(4326))

# Map points
ggplot() +
  geom_sf(data = aus,
          colour = "grey60",
          fill = NA) +
  geom_point(data = plants,
             aes(x = decimalLongitude,
                 y = decimalLatitude,
                 colour = names),
             shape = 16,
             alpha = 0.4) +
  pilot::scale_color_pilot() +
  pilot::theme_pilot() +
  coord_sf(xlim = c(110, 155), 
           ylim = c(-45, -10)) +
  facet_wrap( ~ names, ncol = 3) + 
  theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.border = element_rect(linewidth = 1, colour = "grey90", fill = NA))

Final thoughts

Box plots are great because they are simple and can be interpreted quickly. These days, there are alternative options that are still simple but more robust and transparent. We hope you feel inspired to try out a beeswarm or raincloud plot yourself! (Maybe you even learned a little something about plants and their leaves as well)

There might be other statistics you might like use to summarise your data. It’s possible to add means and confidence intervals to rain cloud plots, too, though methods differ slightly to other visualisation tools3. Experiment with what works best with your data.

For other great resources on alternatives to box and bar plots in R, check out this article and this article by Cedric Scherer which we found really helpful.

Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2023-08-23
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 austraits   * 2.1.2   2023-07-11 [1] Github (traitecoevo/austraits@53b637c)
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.2.3)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.2.2)
 galah       * 1.5.3   2023-07-01 [1] CRAN (R 4.2.2)
 ggbeeswarm  * 0.7.2   2023-04-29 [1] CRAN (R 4.2.3)
 ggdist      * 3.3.0   2023-05-13 [1] CRAN (R 4.2.3)
 gghalves    * 0.1.4   2022-11-20 [1] CRAN (R 4.2.3)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.2.3)
 ggtext      * 0.1.2   2022-09-16 [1] CRAN (R 4.2.2)
 glue        * 1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
 here        * 1.0.1   2020-12-13 [1] CRAN (R 4.2.1)
 htmltools   * 0.5.5   2023-03-23 [1] CRAN (R 4.2.3)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.2.2)
 ozmaps      * 0.4.5   2021-08-03 [1] CRAN (R 4.2.1)
 pilot       * 4.0.0   2022-07-13 [1] Github (olihawkins/pilot@f08cc16)
 purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.2.2)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.2.2)
 RefManageR  * 1.4.0   2022-09-30 [1] CRAN (R 4.2.3)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
 sf          * 1.0-9   2022-11-08 [1] CRAN (R 4.2.2)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.2.2)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.2.3)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.2.2)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.2)

 [1] C:/Users/KEL329/R-packages
 [2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

  1. Are plots with bars or boxes any easier to read when oriented up and down (ie. vertically) than side-to-side (ie. horizontally)? If you think about it, it’s slightly strange how deep-set this trend is. Flipping your plot can make longer names easier to read, make differences easier to see when stacked in order, and give space to other elements in your plot!↩︎

  2. Fun fact: “How to make it rain” is an actual subheading in the paper↩︎

  3. If you would like to add mean and confidence intervals (rather than median and quantile intervals) to your raincloud plot, you can use stat_pointinterval() to do this. However, {ggdist} calculates confidence intervals using a Bayesian method to find a point estimate and Higher Density Interval (HDI). This method calculates different estimates to Frequentist confidence intervals, so it’s worth looking up the difference before using HDis. If you are plotting estimates after running a model, the suggested way by the creator of {ggdist} is to pass this information using dist_student_t() and model parameters from broom::tidy() output. This stack overflow thread is also helpful.↩︎